Red Wine Data Exploration

Created by: Jie Min

Data Summary

Introduction

This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. The wine are variants of the Portuguese “Vinho Verde” wine.At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). The data is collected by Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009

Data Summary

## [1] 1599   13
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Note: “X” field is works as the unique id for the wine variants.

Description of attributes

1 - fixed acidity (tartaric acid - g / dm^3): most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity (acetic acid - g / dm^3): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid (g / dm^3) : found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar (g / dm^3): the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides (sodium chloride - g / dm^3): the amount of salt in the wine

6 - free sulfur dioxide (mg / dm^3): the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide (mg / dm^3): amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density (g / cm^3): the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates (potassium sulphate - g / dm3): a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol (% by volume): the percent alcohol content of the wine

Output variable (based on sensory data): 12 - quality (score between 0 and 10)


Univariate Plots Section

I am mostly interested in the relationship between the different chemical attributes and the quality score given by the experts. Therefore, I plan to chart all the individual features in order to find some interesting ones. For each of the features, I will also check to see the quality score of the variants with the minimum and maximum value, which may give me some hints on whether this feature has an effect on the quality. Still, I will rely on the correlation test result later to determine if they have some kind of relastionship.

Besides, as I have no idea about winemaking and the wine composition, I will find some references about winemaking and wine quality and use them as reference.

1. Fixed.acidity

Acidity is a fundamental property of wine, imparting sourness and resistance to microbial infection. Acids are major wine constituents and contribute greatly to its taste. In fact, acids impart the sourness or tartness that is a fundamental feature in wine taste. Wines lacking in acid are “flat.” Chemically the acids influence titrable acidity which affects taste and pH which affects color, stability to oxidation, and consequantly the overall lifespan of a wine. The most abundant of these acids arise in the grapes themselves and carry over into the wine. However, there are also some acids that arise as a result of the fermentation process from either yeast and/or bacteria. Traditionally total acidity is divided into two groups, namely the volatile acids (see separate description) and the nonvolatile or fixed acids. (source)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

## Max: 15.9 | Quality:  5
## Min: 4.6 | Quality:  4

Note: More than 50% of the wine have fixed.acidiity within the range of 7.1 ~ 9.2; The wine with the lowest fixed.acidity does have a lower quality score.

The distribution is slightly right skewed and it has a long tail on the right. And it also have a some outliers.


2. Volatile.acidity

Wine spoilage is legally defined by volatile acidity, largely composed of acetic acid. Volatile acidity refers to the steam distillable acids present in wine, primarily acetic acid but also lactic, formic, butyric, and propionic acids. The average level of acetic acid in a new dry table wine is less than 400 mg/L, though levels may range from undetectable up to 3g/L.

The aroma threshold for acetic acid in red wine varies from 600 mg/L and 900 mg/L, depending on the variety and style. While acetic acid is generally considered a spoilage product (vinegar), some winemakers seek a low or barely detectible level of acetic acid to add to the perceived complexity of a wine. In addition, the production of acetic acid will result in the concomitant formation of other, sometimes unpleasant, aroma compounds (see ethyl acetate and acetaldehyde). These compounds have much lower sensory threshold than acetic acid—both acetaldehyde and ethyl acetate are detectable at less than 200 mg/L in wine. In addition to the undesirable aromas, both acetic acid and acetaldehyde are toxic to Saccharomyces cerevisiae and may lead to stuck fermentations. (source)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

## Max: 1.58 | Quality:  3
## Min: 0.12 | Quality:  7 7 7

Note: The bottle have the maximum volatile acidity has the worst quality score of 3; The bottles containing the minimum volatile acidity on the other hand all have good quality score. This is an indicator that this may be a feature that have affect on the quality.

This distribution is also slightly right skewed with quite a few outliers.


3. Citric.acid

While very common in citrus fruits, such as limes, citric acid is found only in very minute quantities in wine grapes. It often has a concentration about 1/20 that of tartaric acid. The citric acid most commonly found in wine is commercially produced acid supplements derived from fermenting sucrose solutions. These inexpensive supplements can be used by winemakers in acidification to boost the wine’s total acidity. It is used less frequently than tartaric and malic due to the aggressive citric flavors it can add to the wine. When citric acid is added, it is always done after primary alcohol fermentation has been completed due to the tendency of yeast to convert citric into acetic acid. In the European Union, use of citric acid for acidification is prohibited, but limited use of citric acid is permitted for removing excess iron and copper from the wine if potassium ferrocyanide is not available. (wiki)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

## Max: 1 | Quality:  4
## Min: 0 | Quality:  5 5 5 5 7 5 5 6 6 6 6 5 5 5 5 5 6 6 5 5 6 6 6 6 6 6 6 6 5 4 5 6 6 5 5 5 6 5 5 5 3 5 5 6 5 5 5 5 6 5 6 5 7 5 5 7 5 4 6 6 6 6 7 7 6 6 6 5 7 5 5 6 6 6 5 6 6 4 6 6 6 6 6 7 5 5 5 7 6 4 4 5 5 6 6 4 4 5 6 6 5 4 5 5 6 3 6 6 6 5 5 5 5 5 4 3 6 6 6 6 6 5 6 5 6 5 5 6 4 5 5 5
## % of zero citric acid:  8.255159 %

Note: About 50% of variants have less than 0.25 g / dm^3 of citric acid; 8% have no citric acid. There is single outlier with a value of 1. After the sqrt transformation, the data looks normal.


4. Residual.sugar

Among the components influencing how sweet a wine will taste is residual sugar. Residual sugar typically refers to the sugar remaining after fermentation stops, or is stopped, but it can also result from the addition of unfermented must (a technique practiced in Germany and known as Süssreserve) or ordinary table sugar.

Even among the driest wines, it is rare to find wines with a level of less than 1 g/L, due to the unfermentability of certain types of sugars, such as pentose. By contrast, any wine with over 45 g/L would be considered sweet, though many of the great sweet wines have levels much higher than this. For example, the great vintages of Château d’Yquem contain between 100 and 150 g/L of residual sugar. The sweetest form of the Tokaji, the Eszencia – contains over 450 g/L, with exceptional vintages registering 900 g/L. Such wines are balanced, keeping them from becoming cloyingly sweet, by carefully developed use of acidity. This means that the finest sweet wines are made with grape varieties that keep their acidity even at very high ripeness levels, such as Riesling and Chenin blanc. (source)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

## Max: 15.5 | Quality:  5
## Min: 0.9 | Quality:  6 6

Note: 75% of the variants have less than 2.6 g / dm^3 of residual sugar; No wine in the dataset is considered sweet (> 45 (g / dm^3));

The distribution is right skewed and have a extremely long tail on the reight with some outliers. The quality score for the minimum and maximum value have no significant difference.


5. Chlorides

The amount of chloride in wine is influenced by both the terroir and type of grape, and the importance of quantification lies in the fact that wine flavor is strongly impacted by this particular ion, which, in high concentration, gives the wine an undesirable salty taste and significantly decreases its market appeal. (source) Governements have regulation on the chlorides level in wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

## Max: 0.611 | Quality:  5
## Min: 0.012 | Quality:  7 7

Note: 75% of variants are low in chlorides (< 0.09 g / dm^3); There are many outliers.


6. Free.sulfur.dioxide

Sulphur dioxide (SO2) is the most widely used and controversial additive in winemaking. Its main functions are to inhibit or kill unwanted yeasts and bacteria, and to protect wine from oxidation. Only a proportion of the SO2 added to a wine will be effective as an anti-oxidant. The rest will combine with other elements in the wine and cease to be useful. The part lost into the wine is said to be bound, the active part to be free. A good winemaker will try to get the highest proportion of free sulphur to bound that he can. At best this will be about half the amount bound. Red wines do not need any added sulphur dioxide because they naturally contain anti-oxidants, acquired from their skins and stems during fermentation. Conventional winemakers add some anyway. (source)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

## Max: 72 | Quality:  6
## Min: 1 | Quality:  6 6 6

The distribution is clearly right skewed with many outliers.

7. Total.sulfur.dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

## Max: 289 | Quality:  7
## Min: 6 | Quality:  6 5 5

Note: Even though it is not advised not to have sulfur diode in red wine. It is in all the variants; The distribution is right Skewed. Ant it looks more nomal after applying the log transformation.


8. Density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

## Max: 1.00369 | Quality:  6 6
## Min: 0.99007 | Quality:  6 6

Note: The distribution is almost symetrical.


9. pH

The strength of acidity is measured according to pH, with most wines having a pH between 2.9 and 3.9. Generally, the lower the pH, the higher the acidity in the wine. However, there is no direct connection between total acidity and pH (it is possible to find wines with a high pH for wine and high acidity). (wiki)

The pH scale technically is a logarithmic scale that measures the concentration of free hydrogen ions floating around in your wine. The stronger the acid the more hydrogen ions you’ll have so in essence it is a measurement of how strong an acid is.The scale used to measure pH originally went from 0 to 14 with neutral fluids being at 7.0. Young, unripe grapes have high acid levels. As the fruit ripens the acid levels decrease. An over ripe grape will have very low levels of acid. Technology of Winemaking recommended pH < 3.4 for red wine. With the right acid level, which is subjective to a point, you can lock in flavors, aroma, a healthy color, and make sure your wine has good mouthfeel. Your fermentation will also go more smoothly.When the acid levels are too low your wine will lack body, the mouthfeel will be off, and it will taste weak or flabby. The wine can also pick up a brownish hue. (source)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

## Max: 4.01 | Quality:  6 6
## Min: 2.74 | Quality:  4

Note: It distribution is pretty normal with 75% of the variants have pH level less than 3.4. This is also the recommended level for red wine.


10. Sulphates

Sulfites occur naturally in all wines to some extent. Sulfites are commonly introduced to arrest fermentation at a desired time, and may also be added to wine as preservatives to prevent spoilage and oxidation at several stages of the winemaking. Sulfur dioxide (SO2, sulfur with two atoms of oxygen) protects wine from not only oxidation, but also from bacteria. Without sulfites, grape juice would quickly turn to vinegar.

Organic wines are not necessarily sulfite-free, but generally have lower amounts and regulations stipulate lower maximum sulfite contents for these wines. In general, white wines contain more sulfites than red wines and sweeter wines contain more sulfites than drier ones. In the United States, wines bottled after mid-1987 must have a label stating that they contain sulfites if they contain more than 10 parts per million.In the European Union an equivalent regulation came into force in November 2005.In 2012, a new regulation for organic wines came into force. (wiki)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

## Max: 2 | Quality:  4
## Min: 0.33 | Quality:  4

11. Alcohol

  • Low alchohol wine: < 10% AVB
  • Medium-low alchohol wine: 10–11.5% ABV
  • Medium alchohol wine: 11.5%–13.5% ABV
  • Medium-hith alchohol wine: 13.5%–15% ABV
  • High Alcohol Wines: Over 15%
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

## Max: 14.9 | Quality:  5
## Min: 8.4 | Quality:  3 6


12. Quality

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Note: More than 80% of scores are either 5 or 6.

Univariate Analysis

What is the structure of your dataset?

This dataset contains 1,599 red wines variants from the with 12 features on the chemical properties and expert review score of the Portuguese “Vinho Verde” wine and the an extra feature “X” as unique ID for the wine. All the features are numerical except for the quality feature that is discrete(from 3 to 8).

Most of the quality score is 5 and 6.

What is/are the main feature(s) of interest in your dataset?

I am interested in how the different feature influence the quality of the wine. Among all the features, the quality is the subjective review by the experts. And the 11 features are the chemical property of the wine, which individually or combined may influence on the quality of the wine.

And I am also want to see the relationship between the relationship between fixed.acidity, voatile.acidity, citric.acid, and the pH level.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

According to my research, I think volatile.acidity, alcohol, fixed acidity, citric acidity may all influence to the quality of varinats.

Did you create any new variables from existing variables in the dataset?

Yes. I created two new variabels, quality.group and pH.group. quality.group have three different levels : (2,4], (4, 5],and (6, 8] pH.group have three levels: (2.7, 3.21], (3.21, 3.31],(3.31,3.4], (3.4, 4.1]

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The following features have a right skewed distribution: fixed acidity, volatile acidity,free.sulfur.dioxide,sulphates, alchohol. I have applied log transformation on some of them to get to a more normal distribution. And there are also some features that are quite dispersed on the right side with a few outliers with value that is significantly larger than the mean: citric acid, residual sugar, chlorides, total.sulfur.dioxide.


Bivariate Plots Section

In this section, I will first do a correlation tests for all the variables in pairs. And then I will study the feature pairs with absolute correlation value larger than 0.2.

We can see from the charts above that there are weak negative correlation between quality and volatile.acidity(-0.39) and there are moderate positive correlation between quality score and alcohol level(0.48) and weak positive correlation between quality score and citric.acid(0.23) and sulphates (0.25).

And the pH feature have strong correlation with fixed acidity (0.68) and moderate relation with citric acid(-0.54) ; and weak correlation withe density(-0.34), volatile.acidity (0.23) and chlorides(-0.27).

Besides, there are strong/moderate correlation between density and fixed acidity (0.67), fixed aciditiy and citric acid(0.67), volatile.acidity and residual sugar(-0.55), free.sulfur.dioxide and free.sulfur.dioxide(0.67), alcohol and density (-0.5).

Here I will only focus on the features that I am interested in to answer the questions that I have.

I devided the quality score into 3 group, level 1 for the low quality, score 3 or 4, level 2 for the medium quality, and level 3 for the good quality with score of 7 and 8.

And I group the pH to 4 levels by quantiles.

Quality vs. Volatile Acidity.

## [1] -0.3905578

We can see from this boxplot and the distribution that the good quality wine (i.e. score of 7,8) have low volatile acidity which peak at 0.3

Log transform:

After the log transformation, the volatile.acidity is less dispersed. We can see a clear trend in the graph where the quality decrease when volatile acidity increase.

Quality vs. Alcohol

## [1] 0.4761663

Note: even though there is positive correlation between quality and alcohol, the wine with 5quality score have lower medium alcohol value than that of wine with 3 or 4 score, which can be seen from both the boxplot and frequency poly. But there are also more outliers with higher alcohol level.

Quality vs. Citric Acid

## [1] 0.2263725

Quality vs. Sulphates

## [1] 0.2513971

Cut out some outliers in sulphates, and calculate the correlation again.

## [1] 0.3937644

After cutting the outliers, the correlation increaseed from weak to moderate.

Now I’d like to look at the features that may influence pH level

pH vs. Fixed.acidity

## [1] -0.6829782

We can see a clear trend where the pH value decrease while fixed acidity increase.

pH vs. citric acid

## [1] -0.5419041

pH vs. density

## [1] -0.3416993

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I observed that the quality decrease when the volatile.acidity increase. And the quality increasese with the alcohol, citric acid, and sulphates. There are weak negative correlation between quality and volatile.acidity(-0.391) and there are moderate positive correlation between quality score and alcohol level(0.476) and weak positive correlation between quality score and citric.acid(0.226) and sulphates (0.251).

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I observed that the pH level decreases when the fixed. acidity and citric acid increase as expected. And it also decrease when the density increases.The pH feature have strong correlation with fixed acidity (-0.682) and moderate relation with citric acid(-0.542) and ; and weak correlation withe density(-0.342).

What was the strongest relationship you found?

The strongest relationship that I found is the negative correlation between the pH level and the fixed.acidity, which has a r score of -0.682. And the strong relationship related to quality are between the quality and the alchohol level with the score of 0.476.

Multivariate Plots Section

Quality

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Positive correlation group:

alcohol, citric.acid, sulphates

## [1] 0.2421482

Negative correlation group:

volatile.acidity, total.sulfur.dioxide, density

## [1] -0.3035514

## [1] -0.1846952

Combined Group

pH

fixed.acidity, density, citric.acid, chlorides, volatile.acidity

## [1] -0.2811688

## [1] -0.6946296

## [1] -0.5264696

Others

## [1] 0.6116322

## [1] 0.4387683

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I observed that in terms of quality, the features alcohol and sulphates strengthened each other. In other words, higher alcohol and sulphates level leads to better quality score. Besides, the features alcohol and citric.acid also strengthened each other, and the wine quality increases when those features increase. In addition, another pair, sulphates and citric.acid also strengthened each other.

Besides, I also abserved that wine with higher level of fixed.acidity, citric.acid and density has lower pH level. Therefore, these three features also strengthened each other.

On the other hand, the volatile.acidity feature weakens sulphates, alcohol and citric.acid features, resulting in lower quality score.

Were there any interesting or surprising interactions between features?

The result related to volatile.acidity is in accord with the wine study that I have found online. It weakened features that have positive correlations with quality score.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

The above boxplot and frequency polygons shows the relationship quality level and volatile.acidity. The left boxplot showed the quality score in the x-axis and volatile.acidity in the y-axis. And we can see clearly from the plot that wine variations with higher scores have lower min, median, max volatile.acidity. The right frequency polygons shows volatile.acidity in the x-axis and its density in the y-axis. And the three lines with different colors represents three different quality groups by grouping quality score 3,4 as the low quality wine, 5 and 6 as the medium quality, and 7 to 8 as the good quality. And from reading this graph, we can see three lines have different peaks. The line representing good quality wine peak at a lower volatile.acidity level than wine with bad quality score. This plot supports the information I mentioned earlier about winemaking that good wine has low volatile acid level.

Plot Two

Description Two

In the scatter plot above, sulphates feature is shown in the y-axis and alcohol is shown in the x-axis. And I used sequential color scheme to represent different quality groups.It is easy to observe that the feature sulphates and alcohol strengthened each other, and the quality score decrease while they increase.

Plot Three

Description Three

The above scatter plots showed the relationships of wine quality with density and one of three different acids, fixed.acidity, volatile.acidity, and citric.acid. And it revealed something interesting. We can see from the plots that even though that it is not obvious that higher density leads to lower quality scores, the majority of wine variations with density lower than 0.995g/cm^3 have good quality disregarding their fixed.acidity, citric.acid, volatile.acidity level. For those variations with higher density, we can see clear trend and difference between the three acidity property. In the first two plots, when there is same density, variations with higher fixed.acidity and citric.acid have higher quality scores while in the last plot, higher volatile.acidity results in lower quality scores.


Reflection

My major interests are the relationship of chemical properties to the quality and pH level. During the analysis, I found that plots using factorized group level, especially in the frequency polygons plots, are quite messy and couldn’t show the trend clearly. Therefore, I thought of grouping the quality scores into different levels. My original plan is to divide the set into four groups, i.e. [3,4],5,6,[7,8], so that there are similar numbers of variants in each group. However, the visual representation is still not good enough, then I regrouped it into three groups, [3,4],[5,6],[7,8], which shows a clearer trend. On the other hands, I also needed to group the pH level, and I used quartile division to do it. In addition, I have also waried about chemical features that have high correlation, like fixed.acidity and density, etc.) Also, I leave out outliers in some plots so that the major part is not overplotted and all data squeeze into one small space. One problem with the original dataset is that there are 82% of people who gave 5 and 6 scores. And that is not good for the analysis. In future, we should find ways to give people larger range or options which can help when calculating the correlation of quality with other chemical features.


Reference:

This dataset is public available for research. The details are described in [Cortez et al., 2009].

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib